Applying active learning to high-throughput phenotyping algorithms for electronic health records data.
نویسندگان
چکیده
OBJECTIVES Generalizable, high-throughput phenotyping methods based on supervised machine learning (ML) algorithms could significantly accelerate the use of electronic health records data for clinical and translational research. However, they often require large numbers of annotated samples, which are costly and time-consuming to review. We investigated the use of active learning (AL) in ML-based phenotyping algorithms. METHODS We integrated an uncertainty sampling AL approach with support vector machines-based phenotyping algorithms and evaluated its performance using three annotated disease cohorts including rheumatoid arthritis (RA), colorectal cancer (CRC), and venous thromboembolism (VTE). We investigated performance using two types of feature sets: unrefined features, which contained at least all clinical concepts extracted from notes and billing codes; and a smaller set of refined features selected by domain experts. The performance of the AL was compared with a passive learning (PL) approach based on random sampling. RESULTS Our evaluation showed that AL outperformed PL on three phenotyping tasks. When unrefined features were used in the RA and CRC tasks, AL reduced the number of annotated samples required to achieve an area under the curve (AUC) score of 0.95 by 68% and 23%, respectively. AL also achieved a reduction of 68% for VTE with an optimal AUC of 0.70 using refined features. As expected, refined features improved the performance of phenotyping classifiers and required fewer annotated samples. CONCLUSIONS This study demonstrated that AL can be useful in ML-based phenotyping methods. Moreover, AL and feature engineering based on domain knowledge could be combined to develop efficient and generalizable phenotyping methods.
منابع مشابه
Toward high-throughput phenotyping: unbiased automated feature extraction and selection from knowledge sources
OBJECTIVE Analysis of narrative (text) data from electronic health records (EHRs) can improve population-scale phenotyping for clinical and genetic research. Currently, selection of text features for phenotyping algorithms is slow and laborious, requiring extensive and iterative involvement by domain experts. This paper introduces a method to develop phenotyping algorithms in an unbiased manner...
متن کاملElectronic health records-driven phenotyping: challenges, recent advances, and perspectives.
With the completion of the Human Genome Project as well as recent advances in genomic science and comparative biological studies, a new era of individualized medicine is evolving where novel biomedical discoveries are leading to more effective prevention, treatment, and diagnosis of disease. Although altered phenotypes are one of the most reliable manifestations of altered gene functions, resea...
متن کاملElectronic phenotyping with APHRODITE and the Observational Health Sciences and Informatics (OHDSI) data network
The widespread usage of electronic health records (EHRs) for clinical research has produced multiple electronic phenotyping approaches. Methods for electronic phenotyping range from those needing extensive specialized medical expert supervision to those based on semi-supervised learning techniques. We present Automated PHenotype Routine for Observational Definition, Identification, Training and...
متن کاملUsing Association Rule Mining for Phenotype Extraction from Electronic Health Records
The increasing adoption of electronic health records (EHRs) due to Meaningful Use is providing unprecedented opportunities to enable secondary use of EHR data. Significant emphasis is being given to the development of algorithms and methods for phenotype extraction from EHRs to facilitate population-based studies for clinical and translational research. While preliminary work has shown demonstr...
متن کاملComputational Methods for Electronic Health Record-driven Phenotyping
Each year the National Institute of Health spends over 12 billion dollars on patient related medical research. Accurately classifying patients into categories representing disease, exposures, or other medical conditions important to a study is critical when conducting patientrelated research. Without rigorous characterization of patients, also referred to as phenotyping, relationships between e...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Journal of the American Medical Informatics Association : JAMIA
دوره 20 e2 شماره
صفحات -
تاریخ انتشار 2013